class: center, middle, inverse, title-slide .title[ # Introduction to Simple Linear Regression ] .author[ ### .font1[.] ] .author[ ### .font110[Zhaohu (Jonathan) Fan] ] .author[ ### .font70[Information Technology Management] ] .author[ ### .font70[Scheller College of Business] ] .author[ ### .font70[Georgia Institute of Technology] ] .date[ ### .font70[Februrary 3 , 2023] ] --- #Previous experiences with regression </br> </br> .font140[ * I have heard of regression before ] -- .font140[ * I have used regression before in a class or at work ] -- .font140[ * I know what adjusted `\(R^2\)` means ] --- # Main topics </br> </br> .font140[ * Simple linear regression (SLR) ] -- .font140[ * Least squares (LS) estimation ] -- .font140[ * Fitting SLR models in R with the `lm()` function ] --- # Regression: What is it? </br> </br> .font140[ * Simply: The most widely used statistical tool for understanding relationships among variables ] -- .font140[ * The relationship is expressed in the form of an equation or a model connecting the outcome to the factors ] --- # Regression in business </br> </br> .font140[ * Optimal portfolio choice: - .red[Predict] the future joint distribution of asset returns - .blue[Construct] an optimal portfolio (choose weights) ] -- .font140[ * Determining price and marketing strategy: - .red[Estimate] the effect of price and advertisement on sales - .blue[Decide] what is optimal price and ad campaign ] --- # Regression in everything </br> </br> .font140[ * .blue[Straight prediction questions:] - What price should I charge for my car? - What will the interest rates be next month? ] -- .font140[ * .blue[Explanation and understanding:] - Does your income increase if you get an Master Degree? - Is my advertising campaign working? ] --- #Example: Predicting House Prices .center[ <img src="images/Zillow-GA.png" width="650" height="450" > ] .center[ .font80[ Image of Atlanta, GA by Zillow ] ] <!---Atlanta, <img src="images/Zillow-GA.png" width="650" height="450" > GA --> --- #Example: Predicting House Prices .font140[ **Problem**: - Predict market price based on observed characteristics ] -- .font140[ **Solution**: - Look at property sales data where we know the price and some observed characteristics. - Build a decision rule that predicts price as a function of the observed characteristics. ] -- .font140[ **Action**: - We have to define the variables of interest and develop a specific quantitative measure of these variables ] --- # What characteristics should we use? </br> .font140[ * Many factors or variables affect the price of a house - size of house - number of baths - garage - size of land - location, etc. ] -- .font140[ * Easy to quantify price and size but what about other variables such as location, aesthetics, workmanship, etc? ] --- #Simple linear regression (SLR) .font140[ * To keep things super simple, let’s focus only on .blue[size] of the house. ] -- .font140[ * The variable that we use to guide prediction is the .blue[explanatory (or input)] variable, and this is labelled - .blue[X=size of house] (e.g. thousands of square feet) ] -- .font140[ * The value that we seek to predict is called the .red[dependent (or output)] variable, and we denote this as - .red[Y=price of house] (e.g. thousands of dollars) ] --- #Example: Ames housing data .font140[ * Appears to be a linear relationship: .blue[as size goes up, price goes up]. ] <!---Atlanta, Data Visualization scatterplots #plot the relationship between Price and Miles or Price and Year --> .center[ <img src="images/house-data.png" width="400" height="400" > ] --- # “Eyeball” method .font120[ * Appears to be a linear relationship: .blue[as size goes up, price goes up]. * Fitting a line by the “eyeball” method: ] .center[ <img src="images/house-data-1.png" width="400" height="400" > ] --- # “Eyeball” method .font120[ * Appears to be a linear relationship: .blue[as size goes up, price goes up]. * Fitting a line by the “eyeball” method: ] .center[ <img src="images/house-data-2.png" width="400" height="400" > ] --- # Linear prediction .font140[ * Recall that the equation of a line is: `$$Y = b_0 + b_1 X$$` where `\(b_0\)` is the intercept and `\(b_1\)` is the slope.] .font140[ * The intercept value is in units of Y ($1,000). * The slope is in units of Y per units of X ($1,000/1,000 sq ft). ] --- # Interpretation of coefficients .font140[ * Recall that the equation of a line is: `$$\text{Price of house} = 3 + 120 \times \text{size of house}$$` ] .font140[ * **Slope** is 120: - The average price of a house increases by an estimated $ 120 for every square feet increase in size. ] <!--- -more of house, price of house grows ---> -- .font140[ * **Intercept** is 3: - The average price of a house when a house is zero squre feet. ] -- .font140[ * **Does interpreting the intercept make sense in this problem?** ] --- # What is a good line? </br> .font160[ .center[.red[Can we do better than the eyeball method?] ] ] -- .font120[ * We desire a strategy for estimating the slope and intercept parameters in the model `\(\hat{Y} = b_0 + b_1 X\)`. ] -- .font120[ That involves * choosing a .red[criteria], i.e., quantifying how good a line is ] -- .font120[ * and matching that with a .blue[solution] i.e., finding the best line subject to that criteria. ] --- class: clear .font140[ A reasonable goal is to minimize the size of all residuals: ] -- .font120[ * Residual errors `\(e_i\)` is the distance from the .blue[observed value] to the .red[red solid line] to `\(e_i=(Y_i −\widehat{Y_i})\)`. * The red solid line is our .red[predictions or fitted values]: `\(\widehat{Y_i} = b_0 + b_1 X_i\)`. ] .center[ <img src="images/LS.png" width="360" height="360" > ] --- # Least Squares (LS) .font120[ The line fitting process: * Give weights to all of the residuals (positive and negative), .e.g `\(e_i^2\)` * Trade-off between moving closer to some points and at the same time moving away from other points. ] -- .font120[ * Least square choose `\(b_0\)` and `\(b_1\)` to minimize `$$\sum^N_{i=1} e_i^2 = \sum^N_{i=1} (Y_i - \widehat{Y_i}) ^2 = \sum^N_{i=1} (Y_i - [b_0 + b_1 X_i ]) ^2$$` ] --- # R's built-in lm() function .font120[ * The `lm()` function can be used to fit the SLR model (or any LM for that matter!) - In R, type `?lm` to view the associated documentation/help page ] -- .font120[ * The statement `lm(y ~ x, data = df)` fits an SLR model by regressing `y` on `x`, where `y` and `x` are columns in `df` ] -- .font120[ * To suppress the intercept term, use `y ~ x - 1` (not often necessary) ] --- #Example: Ames housing data .font120[ Fit an SLR model to the Ames housing data using `price` as the response and `size` as the predictor and interpret the estimated coefficients. ] .pull-left[ .font90[.purple[**Code**]] .code80[ ```R set.seed(750) # for reproducibility data(ames, package = "modeldata") # Load the data (if not already loaded) ames$Price <- ames$Sale_Price / 1000 # rescale response ames$Size <- ames$Gr_Liv_Area / 1000 # rescale predictor ids <- sample.int(nrow(ames), size = 50)# rows to select at random ames.trn <- ames[ids, ] # training (or model building) data *fit <- lm(Price ~ Size, data = ames.trn)#<< # Fit an SLR model to the data summary(fit) # print a more verbose summary ``` ]] -- .pull-right[ .font90[.purple[**Output**]] .code55[ ```R Call: lm(formula = Price ~ Size, data = ames.trn) Residuals: Min 1Q Median 3Q Max -95.591 -27.706 -5.042 28.520 174.538 Coefficients: Estimate Std. Error t value Pr(>|t|) *(Intercept) -22.45 23.33 -0.962 0.341 *Size 137.18 14.96 9.173 3.96e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 48.36 on 48 degrees of freedom *Multiple R-squared: 0.6367, Adjusted R-squared: 0.6292 F-statistic: 84.14 on 1 and 48 DF, p-value: 3.958e-12 ``` ]] -- --- #Example: Ames housing data .font120[ The estimated model is: `$$\text{Price of house} = -22.45 + 137.18 \times \text{size of house}$$` ] -- .font120[ **Slope** is 137.18 - The average price of a house increases by an estimated $137.18 for every square feet increase in size ] -- .font120[ **Intercept** is -22.45 - `\(\mathbb{E}\left(\text{Price}|\text{Size} = 0\right)=-22.45\)` - Does interpreting the intercept make sense in this problem? ] -- .font120[ `\(R^2\)` is 62.9% - 62.9% of the price of house variation explained by size of house ] --- # The fitted LS line ```r tail(cbind(ames.trn$Price, "fitted_values" = fitted(fit))) ## fitted_values ## 45 150 194.5631 ## 46 226 242.1637 ## 47 161 166.5789 ## 48 106 112.9426 ## 49 132 151.4894 ## 50 373 320.0805 ``` --- # Residuals: `\(Y_i - \widehat{Y}_i\)` .pull-left[ .center[ <img src="images/LS.png" width="360" height="360" > ] ] .pull-right[ .font90[.purple[**Code**]] .code60[ ```r ggplot(ames.trn, aes(x = Size, y = Price)) + geom_point(size =3,color="blue")+ *geom_smooth(method = "lm", formula = y ~ x, * se = FALSE, alpha = 0.5, color="red") + geom_segment(aes(x = Size, y = fitted(fit), xend = Size, yend = Price), alpha = 0.75, size=1, col = "black") + geom_point(aes(x = Size, y = fitted(fit)), color = "black", size = 2) + labs(x = "Size (K)", y = "Price (K)", title = "Ames housing data")+ theme(axis.title = element_text(face="bold"))+ theme(axis.title.y = element_text(face="bold"))+ theme(text = element_text(size = 18))+ theme(plot.title = element_text(face="bold", size=18)) ``` ] ] --- #Steps in a regression analysis .font100[ - Step 1. State the problem ] -- .font100[ - Step 2. Data collection ] -- .font100[ - Step 3. Model fitting & estimation (this class) * Model specification (linear? logistic?) * Select potentially relevant variables * Model fitting (least squares) * Model validation and criticism * Back to 3.1? Back to 2? ] -- .font100[ - Step 4. Answering the posed questions * But that oversimplifies a bit; + it is more iterative, and can be more art than science ] --- class: clear,middle,center </br> .font220[ Thank you! ] </br> .font120[ Zhaohu(Jonathan) Fan PhD Candidate in Business Analytics fanzh@ucmail.uc.edu ] --- # Simple linear regression * .font150[Data]: `\(\LARGE \left\{\left(X_i, Y_i\right)\right\}_{i=1}^n\)` * .font150[Model]: `\(\LARGE Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)` - `\(\LARGE Y_i\)` .font150[is a continuous response] - `\(\LARGE X_i\)` .font150[is a continuous predictor] - `\(\LARGE \beta_0\)` .font150[is the intercept of the regression line] - `\(\LARGE \beta_1\)` .font150[is the slope of the regression line] - `\(\LARGE \epsilon_i \stackrel{iid}{\sim} N\left(0, \sigma^2\right)\)` --- # .font90[More examples of statistical relationships] </br> - .font150[ Simple linear regression]: `\(\LARGE Y = \beta_0 + \beta_1 X + \epsilon\)` - .font150[ Multiple linear regression]: `\(\LARGE Y = \beta_0 + \sum_{i=1}^p \beta_i X_i + \epsilon\)` - .font150[Polynomial regression]: `\(\LARGE Y = \beta_0 + \sum_{i=1}^p \beta_i X^i + \epsilon\)` - .font150[Nonlinear regression]: `\(\LARGE Y = \frac{\beta_1 X}{\left(\beta_2 + X\right)} + \epsilon\)` - .font150[ and more.]